MMMLU (Multilingual MMLU)

Testing LLMs across 14 languages and 57 subjects — OpenAI’s professionally human-translated benchmark for multilingual knowledge and reasoning

Published

September 7, 2025

Keywords: MMMLU, Multilingual MMLU, multilingual benchmark, LLM evaluation, MMLU, OpenAI, multilingual reasoning, low-resource languages, professional translation, cross-lingual evaluation, knowledge assessment, Yoruba, Swahili, Arabic, Bengali

Introduction

Most AI benchmarks evaluate LLMs in English only — but billions of people around the world interact with AI in their native language. How do we know if a model that scores 90% on English knowledge tests can perform equally well in Arabic, Bengali, Swahili, or Yoruba?

MMMLU (Multilingual Massive Multitask Language Understanding) answers this question directly. Created by OpenAI, it takes the widely used MMLU benchmark — 57 subjects spanning elementary to professional-level knowledge — and translates the entire test set into 14 languages using professional human translators. The result is a rigorous, high-quality multilingual evaluation that exposes dramatic performance gaps between high-resource and low-resource languages.

“Relying on human translators for this evaluation increases confidence in the accuracy of the translations, especially for low-resource languages like Yoruba.” — OpenAI MMMLU Dataset

graph LR
    A["MMLU<br/>(English only)<br/>57 subjects"] --> B["Limited to<br/>English-speaking<br/>evaluation"]
    B --> C["MMMLU<br/>14 languages<br/>Human-translated"]
    C --> D["True multilingual<br/>knowledge<br/>assessment"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is MMMLU?

MMMLU is a multilingual extension of the MMLU (Massive Multitask Language Understanding) benchmark. It contains the complete MMLU test set — covering 57 subjects from elementary mathematics to advanced professional topics like law, medicine, and computer science — professionally translated into 14 languages by human translators.

The benchmark ensures that translations are accurate and culturally appropriate, particularly for low-resource languages where machine translation quality is unreliable. This makes MMMLU the gold standard for evaluating whether LLMs can reason and recall knowledge across linguistic boundaries.

Languages Covered

Language Locale Resource Level
Arabic AR_XY Medium
Bengali BN_BD Low
Chinese (Simplified) ZH_CN High
French FR_FR High
German DE_DE High
Hindi HI_IN Medium
Indonesian ID_ID Medium
Italian IT_IT High
Japanese JA_JP High
Korean KO_KR High
Portuguese (Brazil) PT_BR High
Spanish ES_LA High
Swahili SW_KE Low
Yoruba YO_NG Low

Key Characteristics

Feature Details
Base benchmark MMLU (57 subjects, ~14,000 test questions)
Languages 14 (professional human translations)
Total questions ~197,000 (14 × ~14,000)
Question format Multiple-choice (4 options)
Evaluation Zero-shot, chain-of-thought
Subjects Elementary math to professional law, medicine, CS
Translation quality Professional human translators (not machine translation)
License MIT

Who Built It?

MMMLU was created by OpenAI as part of their commitment to improving multilingual AI capabilities. The translations were commissioned using professional human translators — a deliberate choice over machine translation to ensure accuracy, especially for low-resource languages like Yoruba and Swahili.

The original MMLU benchmark that MMMLU builds upon was created by:

  • Dan Hendrycks — UC Berkeley (now Center for AI Safety)
  • Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika — UC Berkeley
  • Dawn Song, Jacob Steinhardt — UC Berkeley

MMLU was published at ICLR 2021 and quickly became one of the most widely used benchmarks in AI evaluation.

Publication and Resources

Resource Link
MMMLU Dataset huggingface.co/datasets/openai/MMMLU
Evaluation code github.com/openai/simple-evals
Original MMLU paper arxiv.org/abs/2009.03300
Community Leaderboard Multilingual MMLU Benchmark Leaderboard

What Skills Does It Test?

MMMLU tests the same broad spectrum of knowledge and reasoning as MMLU — but crucially, it measures whether models can perform these tasks in non-English languages. This reveals both knowledge depth and cross-lingual transfer capabilities.

graph TD
    MMMLU["MMMLU<br/>Multilingual Knowledge"] --> A["STEM<br/>Math, Physics,<br/>Computer Science"]
    MMMLU --> B["Humanities<br/>History, Philosophy,<br/>Literature"]
    MMMLU --> C["Social Sciences<br/>Economics, Law,<br/>Psychology"]
    MMMLU --> D["Professional<br/>Medicine, Law,<br/>Engineering"]
    MMMLU --> E["Cross-Lingual<br/>Transfer<br/>Same knowledge,<br/>14 languages"]
    MMMLU --> F["Low-Resource<br/>Language<br/>Understanding<br/>Yoruba, Swahili,<br/>Bengali"]

    style MMMLU fill:#e74c3c,color:#fff,stroke:#333
    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333
    style F fill:#6cc3d5,color:#fff,stroke:#333

Capability What MMMLU Tests
Multilingual knowledge recall Can the model access the same factual knowledge in Arabic as in English?
Cross-lingual reasoning Can the model solve multi-step problems when the question is in Japanese or Hindi?
Low-resource language fluency How much performance degrades for Yoruba, Swahili, and Bengali vs. high-resource languages
Subject breadth 57 subjects from elementary to professional level — same questions across all languages
Translation robustness Whether subtle linguistic differences affect model accuracy
Inclusive AI assessment Real-world readiness for global deployment across diverse language communities

The 57 MMLU Subject Categories (Grouped)

Domain Example Subjects
STEM Abstract Algebra, Astronomy, College Mathematics, Computer Science, Electrical Engineering, Physics
Humanities Formal Logic, Jurisprudence, Moral Disputes, Philosophy, Prehistory, World Religions
Social Sciences Econometrics, Human Sexuality, Marketing, Public Relations, Sociology, US Foreign Policy
Professional Clinical Knowledge, Medical Genetics, Professional Accounting, Professional Law, Professional Medicine
Other Global Facts, Miscellaneous, Nutrition, Virology

Current Leaderboard

The table below shows the average accuracy across all 14 languages for each model, as published in the official OpenAI MMMLU benchmark results.

Source: OpenAI Simple Evals — MMMLU Results (consulted March 28, 2026). Evaluation uses zero-shot chain-of-thought prompting.

Average Accuracy Across 14 Languages

Rank Model Avg. Accuracy (%)
1 o3 (high) 88.8
2 o1 87.7
3 o4-mini (high) 85.2
4 GPT-4.5 Preview 85.1
5 GPT-4.1 83.7
6 o3-mini (high) 80.7
7 GPT-4o (Nov 2024) 81.4
8 GPT-4.1 Mini 78.5
9 GPT-4o Mini 70.5
10 GPT-4.1 Nano 66.9

Performance by Language (Top Model: o3-high)

The table below reveals the language performance gap — even for the best model, accuracy ranges from 91.2% (Italian) to just 78.0% (Yoruba).

Language o3 (high) o1 GPT-4.1 GPT-4.1 Nano
Italian 91.2% 89.7% 86.9% 73.4%
Spanish 91.1% 89.9% 87.6% 74.8%
Portuguese (Brazil) 91.0% 89.5% 87.0% 74.1%
French 90.6% 89.3% 87.0% 73.9%
German 90.5% 89.0% 85.5% 72.2%
Arabic 90.4% 89.0% 84.4% 65.9%
Hindi 89.8% 88.3% 84.2% 62.9%
Indonesian 89.8% 88.6% 85.9% 71.4%
Chinese (Simplified) 89.3% 88.9% 86.1% 71.0%
Korean 89.3% 88.2% 84.9% 67.9%
Japanese 89.0% 88.9% 85.6% 69.0%
Bengali 87.8% 87.3% 82.7% 58.3%
Swahili 86.0% 85.4% 79.5% 56.6%
Yoruba 78.0% 75.4% 64.7% 45.5%

Key takeaways:

  • Massive gap between high-resource and low-resource languages — o3 (high) scores 91.2% on Italian but only 78.0% on Yoruba, a 13+ point gap
  • Yoruba is the hardest language for all models — GPT-4.1 Nano drops to 45.5%, barely above random chance (25%)
  • Reasoning models (o-series) lead the rankings — o3 (high) at 88.8% average outperforms the best non-reasoning model GPT-4.5 Preview at 85.1%
  • Smaller models suffer disproportionately on low-resource languages — GPT-4.1 Nano drops 28 points from Italian (73.4%) to Yoruba (45.5%)
  • European languages cluster together — Italian, Spanish, Portuguese, French, and German all score within ~1% of each other

Where to Explore the Benchmark

Dataset and Evaluation

Resource Description Link
MMMLU Dataset Full 197K-question dataset across 14 languages on Hugging Face huggingface.co/datasets/openai/MMMLU
Official Results Benchmark results with scores for all models and languages github.com/openai/simple-evals
Community Leaderboard Interactive HuggingFace Space for exploring multilingual results Multilingual MMLU Leaderboard

Load the Dataset

from datasets import load_dataset

# Load all languages
dataset = load_dataset("openai/MMMLU", split="test")

# Load a specific language
dataset_fr = load_dataset("openai/MMMLU", "FR_FR", split="test")

Understanding the Metric

Accuracy (Zero-Shot Chain-of-Thought)

Models are evaluated using zero-shot chain-of-thought prompting — no few-shot examples, no role-playing prompts. The model receives a multiple-choice question in the target language and must select the correct answer (A, B, C, or D).

Approach Description
Zero-shot No examples provided — tests raw capability
Chain-of-thought Model can reason step by step before answering
Per-language scoring Accuracy computed separately for each of the 14 languages
Average score Mean accuracy across all 14 languages

Why Professional Human Translation Matters

graph LR
    A["Machine Translation<br/>Errors in low-resource<br/>languages"] --> B["Unreliable<br/>benchmark<br/>scores"]
    C["Professional Human<br/>Translation<br/>(MMMLU approach)"] --> D["Accurate, culturally<br/>appropriate<br/>evaluation"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333

Machine-translated benchmarks often contain errors that disproportionately affect low-resource languages, making it unclear whether poor performance reflects model weakness or translation quality. By using professional human translators, MMMLU isolates the variable being tested: the model’s actual multilingual capability.

Why MMMLU Matters

graph LR
    A["English-only<br/>benchmarks"] --> C["MMMLU<br/>14 languages<br/>human-translated"]
    B["Machine-translated<br/>benchmarks<br/>(unreliable)"] --> C
    C --> D["True multilingual<br/>AI performance<br/>measurement"]
    C --> E["Exposes gaps<br/>for underserved<br/>language communities"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

  1. Exposes the multilingual gap — Even the best model drops 13+ points between its strongest and weakest language, revealing that “multilingual” models are far from language-equitable
  2. High-quality human translations — Professional translators ensure the benchmark tests the model, not the translation quality
  3. Low-resource language visibility — Yoruba, Swahili, and Bengali scores expose the real-world readiness (or lack thereof) of LLMs for billions of speakers
  4. 57-subject breadth — Tests knowledge and reasoning across the full academic spectrum, not just narrow domains
  5. Practical deployment signal — Organizations deploying AI globally need to know exactly how much performance they lose in each language

Video: MMMLU Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

MMMLU sets the standard for multilingual AI evaluation:

  • 14 languages, 57 subjects, ~197,000 questions — the most comprehensive professionally translated multilingual knowledge benchmark
  • Professional human translators ensure accuracy, especially for low-resource languages where machine translation fails
  • The best model (o3 high) averages 88.8% — but drops to just 78.0% on Yoruba, exposing a 13+ point multilingual gap
  • Smaller models suffer disproportionately — GPT-4.1 Nano scores 73.4% on Italian but 45.5% on Yoruba, near random chance
  • Low-resource languages need urgent attention — Yoruba and Swahili consistently lag 5–15% behind high-resource languages across all models

As AI goes global, MMMLU provides the essential reality check: how well does your model actually work for the world’s diverse language communities? For most languages, the answer is “significantly worse than English” — and for low-resource languages, the gap is alarming.

References

Read More